Introduction

“wineQualityReds.csv” dataset contained 1599 observations and 13 variables. There is ‘X’ variable whose attribute is unknown. Remaining 12 variables described various properties of wines. Quality variable carried integer values from 3 to 10 which is quality ratings of wine from at least three wine experts (3 being worst rating and 8 being the best rating).

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Univariate Plots Section

First variables were analysed using univariate analysis to get the feel of overall data distribution. This will help in making statistical assumptions in next steps. Univariate data analysis is a very useful way to check the quality and distribution of data and also to check for outliers.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Structure of the dataset gives basic strucutre of data in compact form in one line.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Summary gives results of basic statistics functions.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

On ploting the variables, we observed

  • Density and pH plots were normally distributed with very few outliers.
  • Fixed Acidity, volatile Acidity, Citric Acid, Free Sulfur Dioxide, Total Sulfur Dioxide, Sulphates plots were positively skewed.
  • Residual Sugar and Chlorides had long tailed distribution with many outliers.
  • Quality was an integer, so there was not much detail in the plot. Most wines were rated 5 or 6. Quality was fairly normally distributed.

log transformation on skewed plots

We plotted Fixed Acidity, Volatile Acidity, Citric Acid, Free sulfur dioxide, Total sulfur dioxide and sulfates in log10 scale.

  • The log10 plot did normalilize the distribution for Fixed Acidity and Volatile Acidity.
  • log10 plot for Citric Acid still showed long tail distribution but on the left side in contrast to before where it was on the right side and positively skewed.
  • Free sulfur dioxide and Total sulfur dioxide had positively skewed long tail distribution in the regular plot. On transforming to log10 scale both of them showed significant reduction in the number of outliers and distribution looked fairly normally distributed.
  • Sulphates had long tailed distribution in normal graph. With the log transformation it too looked fairly normal.

New variable - Total acidity

We created new variable total.acidity by taking the sum of fixed acidity, volatile acidity and citric acid to see if it shows any interesting pattern or interesting association.

Since Residual Sugar and Chlorides showed significantly largre number of outliers, we removed top 5% of the data and resulting plot looked fairly normal.

Univariate Analysis

What is the structure of your dataset?

Red wine dataset consists of 1,599 observations of 12 variables which describe different chemical prperties of wine. 11 variables have numeric values whereas one variable, Quality, is an integer. Many of us enjoy wine without knowing the chemistry behind wine’s quality and taste. It is very interesting to know how quality of wine relates to a number of chemical compounds that are present in wine.

What is/are the main feature(s) of interest in your dataset?

To me understanding how quality of wine correlates with other chemical parameters will be very enticing.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Several of the variables are interrelated (e.g. alcohol, density, Fixed Acidity, Volatile Acidity, Citric Acid, pH) in this dataset and change in one chemical parameter can have effect on the other. So I think these chemical constituents mainly alcohol and acidity will have dominant effect in wine quality.

Did you create any new variables from existing variables in the dataset?

I created a new variable Total acidity (Fixed Acidity + Volatile Acidity + Citric Acid), which is the sum of three variables fixed.acidity, citric acid and volatile.acidity.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Log transformation means taking a data set and taking the natural logarithm of variables. Sometimes when data does not quite fit the model we are looking for, a log transformation can help to fit a very skewed distribution into a more normal model. Such that, we can more easily see patterns in our data. Log transformation itself does not “normalize” our data but it can reduce skew if the data is highly skewed to the right. Some of the variables for e.g; Total Sulfur dioxide, Free Sulfur dioxide, Citric Acid, Volatile Acidity, Fixed Acidity had a positively skewed distribution. These plots after log10transformation, looked fairly normal. However Citric Acid which had positive skew on the regular graph, on log10 transformation it got shifted to the negative side. Not quite sure what this implies to. There were a few variables with large number of outliers ( residual sugar, chlorides). When Top 5 percent of data was removed, fairly normal graph was obtained.

Bivariate Plots Section

Bivariant correlation matrix was created to explore positive and negative associations among variables.

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9      0.08
## 2 2           7.8             0.88        0.00            2.6      0.10
## 3 3           7.8             0.76        0.04            2.3      0.09
## 4 4          11.2             0.28        0.56            1.9      0.08
## 5 5           7.4             0.70        0.00            1.9      0.08
## 6 6           7.4             0.66        0.00            1.8      0.08
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34       1 3.51      0.56     9.4
## 2                  25                   67       1 3.20      0.68     9.8
## 3                  15                   54       1 3.26      0.65     9.8
## 4                  17                   60       1 3.16      0.58     9.8
## 5                  11                   34       1 3.51      0.56     9.4
## 6                  13                   40       1 3.51      0.56     9.4
##   quality total.acidity
## 1       5          8.10
## 2       5          8.68
## 3       5          8.60
## 4       6         12.04
## 5       5          8.10
## 6       5          8.06

The focus of this data exploration was to find chemical parameters affecting wine quality. However in our analysis we did not see any strong relationships between quality and other variables in bivariate correlation matrix plot. This plot showed - variables with positive correlations (r value > 0.45) were quality and alcohol, fixed acidity and density. Variables with negative correlations were alcohol and density, fixed acidity and pH. These correlations were further studied in our bivariate analysis. Variables with large number of outliers could be a reason that we did not see strong relationship between quality and chemical parameters. Besides, quality being a rating variable with integer values and with maximum number of wines confined to the score of 5 or 6, could explain the lack of strong correlation between quality and other variables.

According to the above plots,

  • Quality ratings was high among the wines which contained higher alcohol content (%/volume).
  • Density of wine had inverse relationship with quality. Density decreased as the quality of wine increased.
  • pH decreased (or acidity increased) as the wine ratings increased.
  • Residual sugar was same among wines rated 3, 4 or 8.

Boxplots shows relationships between quality ratings and variables - pH, alcohol and density. More alcohol content, less dense and more acidic wines are considered high quality wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

According to the correlation matrix, graphs and boxplots, three parameters those showed strong correlations with quality were: density, alcohol, and pH/acidity. These plots revealed wines containing higher alcohol content (%/vol) were rated high in terms of quality. Furthermore, density and quality ratings of wines were found to be inversely proportional. This suggested lower the wine density, higher was the wine rating. Thirdly the pH of red wine varied between pH 2.8 and 4.0. Highly rated wines were in the more acidic side of the graph.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Other relationship observed was among variables describing acidity and pH. As fixed acidity, volatile acidity, citric acid, total acidity and pH all variables describe the acidic property of wine, these five variables showed some sort of association among themselves (for eg. fixed acidity and pH, volatile acidity and pH, citric acid and pH, total acidity and pH etc). Since their association with the quality variable was not very strong as shown by matrix graph, this association was not studied further.

What was the strongest relationship you found?

The strongest association involving quality variable (variable of interest) was between quality and alcohol content(r-squared value 0.48). Strongest association between any two variables was total acidity and citric acid (r-squared value 0.69).

Multivariate Plots Section

Alcohol and pH relationship from above plots imply higher alcohol content with low pH make quality wines.

Wines having higher %/volume alcohol content and low density are high quality wines.

Above two plots show well understood and inverse relationship between acidity and pH. In the pH and fixed acidity plot no specific pattern was observed. In the plot depicting volatile acidity and pH relationship, we saw that high quality wines have pH between 3 and 3.5 and contain less volatile acidity.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + density, data = wine)
## m3: lm(formula = quality ~ alcohol + density + volatile.acidity, 
##     data = wine)
## m4: lm(formula = quality ~ alcohol + density + volatile.acidity + 
##     fixed.acidity, data = wine)
## m5: lm(formula = quality ~ alcohol + density + volatile.acidity + 
##     fixed.acidity + citric.acid, data = wine)
## 
## ==========================================================================================
##                          m1            m2            m3            m4            m5       
## ------------------------------------------------------------------------------------------
##   (Intercept)           1.875***    -33.152**     -18.407        15.573        13.448     
##                        (0.175)      (10.878)      (10.298)      (15.187)      (15.198)    
##   alcohol               0.361***      0.391***      0.333***      0.311***      0.316***  
##                        (0.017)       (0.019)       (0.018)       (0.020)       (0.020)    
##   density                            34.822**      21.360*      -12.922       -10.845     
##                                     (10.813)      (10.228)      (15.214)      (15.223)    
##   volatile.acidity                                 -1.365***     -1.272***     -1.405***  
##                                                    (0.096)       (0.100)       (0.116)    
##   fixed.acidity                                                   0.045**       0.063***  
##                                                                  (0.015)       (0.017)    
##   citric.acid                                                                  -0.308*    
##                                                                                (0.137)    
## ------------------------------------------------------------------------------------------
##   R-squared             0.227         0.232         0.319         0.323         0.325     
##   adj. R-squared        0.226         0.231         0.318         0.321         0.323     
##   sigma                 0.710         0.708         0.667         0.665         0.665     
##   F                   468.267       240.693       248.893       189.939       153.348     
##   p                     0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1715.878     -1619.631     -1615.017     -1612.485     
##   Deviance            805.870       800.668       709.855       705.771       703.539     
##   AIC                3448.114      3439.757      3249.261      3242.034      3238.969     
##   BIC                3464.245      3461.265      3276.147      3274.297      3276.609     
##   N                  1599          1599          1599          1599          1599         
## ==========================================================================================

After analysing the relationship of variables with univariate analysis, bivariante analysis and multivariate analysis, we built a syntax for linear model. We can use linear model to predict the quality value if a corresponding alcohol or other values are known. Before using this regression model, model was examined for its statistic significance. p values of the linear model and predictor variables (alcohol, fixed acidity, volatile acidity and citric acid) were less than 0.05.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Fixed acidity and citric acid showed one of the strong positive association as per bivariate matrix plot. In the above plot of citric acid and fixed acidity with the points colored by wine quality category, we saw high quality wines confined towards one side of the plot.

Were there any interesting or surprising interactions between features?

To me surprising interaction was no interaction between quality and residual sugar variables. Residual sugar and chlorides are very important parameters in the quality of wine however in our study we found very weak association.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a linear model using variables alcohol, density, volatile acidity, fixed acidity and citric acid as the predictor variables and wine quality as the outcome variable. The overall r-squared value for the model was quite low 0.324. However the model was statistically significant for variables alcohol, fixed acidity, volatile acidity and citric acid. The summary showed p-Values were less than 0.05, pre-determined statistical significance level.

The most important predictor variable in the model is alcohol. The limitation of this model would be the lack of diversity in the dataset for quality variable as more than 80 % of wines in the dataset were rated 5 or 6.

Final Plots and Summary

Plot One

For the first plot, I chose the corrplot which gave both the big picture of association between variables and also the details of correlation values. Color function made distinction between positive and negative correlation.

Plot Two

Acidity plays important role in wine quality. Fixed acidity and citric acid showed positive association and are preferred parameters in good quality wines. In contrast, Volatile acidity is considered a flaw in wine making.

Plot Three

Among all the chemical attributes, alcohol had the strongest association with the quality rating of the wine (rsquare 0.476). Less dense (density < 1) wines with higher alcohol percentage by volume were more likely to get higher quality ratings.

Reflection

This study explored Red wine dataset containing 1599 observations on 13 different attributes. Among 13 variables, 11 were chemical parameters which play important role in wine taste and quality. Main objective of this study was to explore relationship between quality and other chemical parameters. Using statistical methods and graphical analysis, different associations were studied between predictor and predicted variables. Despite many variables in the dataset, only very few showed strong relationship with quality:

  • alcohol (r-squared value = 0.48)
  • volatile acidity(r-squared value = -0.39 )
  • citric acid(r-squared value = 0.23)

These values were included in linear model (combined r-squared value = 0.3249). This low r-squared value implies that the interaction among variable was not very strong and this model would predict only 32% of wine quality. According to the definition of R-square, it is the percentage of the response variable variation that is explained by a linear model.

Our study revealed, wines containing higher alcohol percentage but less volatile acids were considered high quality wines. Besides, wines on the more acidic side and with less density were perceived better in the taste and quality.

Despite being a large dataset of 1599 observations, this dataset had drawback of limited variability. Quality variable which actually was wine ratings in the integer form - from 0 to 10. The distribution was so ununiform that more than 80 % of wines had the ratings of 5 or 6. There were 10 wines with ratings of 3, 53 wines with ratings of 4,199 wines with ratings of 7 and 18 wines with ratings of 8. Thus, there were not sufficient number of observations for the quality rating 8 or 7 or 4 or 3. Because of this limitation, it was very difficult to assess the relationship between quality variable and chemical parameters. Data would have been more useful and more insightful if the data was more uniformly distributed.

Irresective of these limitations, this dataset was very interesting and challanging to work with. Working with so many variables provided great opportunity to study different interactions. It would be more interesting to explore white wine dataset and compare variables and linear models between the datasets.

References

https://docs.google.com/document/d/1qEcwltBMlRYZT-l699-71TzInWfk4W9q5rTCSvDVMpc/pub?embedded=true https://www.practicalwinery.com/janfeb09/page2.htm https://en.wikipedia.org/wiki/Acids_in_wine http://www.statisticshowto.com/probability-and-statistics/skewed-distribution/ http://winefolly.com/update/sugar-in-wine-misunderstanding/ https://discussions.udacity.com/t/exploratory-data-analysis/249185 https://classroom.udacity.com/nanodegrees/nd002/parts/0021345407/modules/316518875375461/lessons/3165188753239847/project